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Abstract 

We  study  the  computational,  communication,  and  scalability  characteristics  of  a  Compu¬ 
tational  Fluid  Dynamics  application,  which  solves  the  time  accurate  flow  field  of  a  jet  using 
the  compressible  Navier-Stokes  equations,  on  a  variety  of  parallel  architecture  platforms. 
The  platforms  chosen  for  this  study  are  a  cluster  of  workstations  (the  LACE  experimental 
testbed  at  NASA  Lewis),  a  shared  memory  multiprocessor  (the  Cray  YMP),  and  distributed 
memory  multiprocessors  with  different  topologies  —  the  IBM  SP  and  the  Cray  T3D.  We 
investigate  the  impact  of  various  networks  connecting  the  cluster  of  workstations  on  the  per¬ 
formance  of  the  application  and  the  overheads  induced  by  popular  message  passing  libraries 
used  for  parallelization.  The  work  also  highlights  the  importance  of  matching  the  memory 
bandwidth  to  the  processor  speed  for  good  single  processor  performance.  By  studying  the 
performance  of  an  application  on  a  variety  of  architectures,  we  are  able  to  point  out  the 
strengths  and  weaknesses  of  each  of  the  example  computing  platforms. 
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NASA  Contract  No.  NAS1-19480  while  the  second  author  was  in  residence  at  the  Institute  for  Computer 
Applications  in  Science  and  Engineering  (ICASE),  NASA  Langley  Research  Center,  Hampton,  VA  23681- 
0001. 
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1  Introduction 


Numerical  simulations  play  an  important  role  in  the  investigation  of  physical  processes  as¬ 
sociated  with  many  important  problems.  The  suppression  of  jet  exhaust  noise  is  one  such 
problem  which  will  have  a  great  impact  on  the  success  of  the  High  Speed  Civil  Transport 
plane.  The  radiated  sound  emanating  from  the  jet  can  be  computed  by  solving  the  full 
(time-dependent)  compressible  Navier-Stokes  equations.  This  computation  can,  however,  be 
very  expensive  and  time  consuming.  The  difficulty  can  be  partially  overcome  by  limiting  the 
solution  domain  to  the  near  field  where  the  jet  is  nonlinear  and  then  using  acoustic  analogy 
(see  [12])  to  relate  the  far-field  noise  to  the  near-field  sources.  This  technique  requires  obtain¬ 
ing  the  time-dependent  flow  field.  In  this  study  we  concentrate  on  such  flow  fields  near  the 
nozzle  exit.  We  solve  the  Navier-Stokes  equations  to  compute  time  accurate  flow  fields  of  a 
supersonic  axisymmetric  jet.  Our  code  is  computationally  very  intensive  and  requires  many 
hours  of  CPU  time  on  the  Cray  Y-MP.  With  the  advent  of  massively  parallel  processors  and 
networks  of  workstations  (NOWs),  scientists  now  have  the  opportunity  to  parallelize  compu¬ 
tationally  intensive  codes  and  reduce  turnaround  time  at  a  fraction  of  the  cost  of  traditional 
supercomputers.  Recognizing  this,  a  number  of  researchers  [5,  10,  14,  18]  have  studied  CFD 
(Computational  Fluid  Dynamics)  applications  on  specific  parallel  architectures.  Our  goal  in 
this  study  is  to  implement  the  numerical  model  derived  from  the  CFD  application  described 
above  on  a  variety  of  parallel  architectural  platforms. 

The  platforms  chosen  for  this  study,  all  from  the  NASA  Lewis  Research  Center,  represent 
a  spectrum  of  parallel  architectures  that  have  been  proposed  to  solve  computationally  in¬ 
tensive  problems:  a  shared  memory  vector  multiprocessor  (the  Cray  YMP),  two  distributed 
memory  multiprocessors  with  different  topologies —  the  IBM  SP  and  the  Cray  T3D,  and 
a  cluster  of  workstations  connected  via  many  networks  (the  Lewis  Advanced  Cluster  Envi¬ 
ronment  (LACE)  [9]  experimental  testbed).  One  important  architecture  that  has  not  been 
considered  in  our  study  is  cache-coherent,  massively  parallel  processors  typified  by  the  DASH 
architecture  [11]. 

An  earlier  paper  by  the  authors  presented  the  results  of  a  study  of  this  application  on 
LACE  [6].  This  paper  differs  from  the  earlier  one  in  two  important  aspects:  i)  It  is  compre¬ 
hensive  covering  a  gamut  of  architectures  while  the  other  examined  the  feasibility  of  NOW 
architectures  as  low  cost  alternatives  to  expensive  supercomputers  and  massively  parallel 
processors,  ii)  It  focuses  on  the  relationship  of  the  performance  results  to  the  computation 
and  communication  characteristics  of  the  application,  to  the  architectural  aspects  of  the  net¬ 
works  and  the  processing  nodes,  and  to  the  programming  tools.  We  have  not  laid  emphasis 
on  the  physical  aspects  of  the  application  or  the  the  details  of  the  numerical  model  as  we 
have  done  in  the  other  paper  in  keeping  with  the  readership  of  two  disparate  communities. 
For  the  sake  of  completeness,  however,  we  have  included  the  relevant  details  from  the  other 
paper  in  this  one. 

In  the  next  section  we  briefly  discuss  the  governing  equations  and  the  numerical  model  of 
the  application.  Section  3  has  a  discussion  of  the  parallel  architectures  used  in  the  study 
and  the  tools  used  for  parallelizing  the  application.  The  parallelization  of  the  application 
is  the  subject  of  Section  4.  Section  5  describes  the  experimental  methodology.  Section  6 
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presents  a  detailed  discussion  of  the  results.  The  paper  concludes  with  a  brief  discussion  of 
the  lessons  learned  from  this  study  and  the  issues  that  merit  further  investigation. 


2  Numerical  Model 


We  solve  the  Navier-Stokes  and  the  Euler  equations  to  compute  flow  fields  of  an  axisymmetric 
jet.  The  Navier-Stokes  equations  for  such  flows  can  be  written,  in  polar  coordinates  as 


where 


LQ  =  S 

dQ  &F  OG 

dt  dx  dr 


(  P  \ 


Q  =  r 


pu 

pv 


V  E  J 


pu  \ 

pu  tci  T  p 
puv  -  Txr 

puH  -  UTXx  ~  VTxr  -  KTX  J 


(  pv  \ 

puv  -  Txt 
G  —  r  2  , 

pv *  -Trr+P 

pvH  —  UTxt  —  VTrr  —  kTt  J 

(  0  \ 

5=  0 

P  —  Tee  ! 

V  0 

F  and  G  are  the  fluxes  in  the  x  and  r  directions  respectively,  and  S  is  the  source  term  that 
arises  in  the  cylindrical  polar  coordinates,  are  the  shear  stresses  and  kTj  are  the  heat 
fluxes.  In  the  above  equations  p,  p,  u,  v,  T,  e  and  H  denote  the  pressure,  density,  axial  and 
radial  velocity  components,  temperature,  total  energy  and  enthalpy. 

'We  use  the  fourth-order  MacCormack  scheme,  due  to  Gottlieb  and  Turkel  [4],  to  solve  the 
Navier-Stokes  and  the  Euler  equations.  This  scheme  uses  predictor  and  corrector  steps 
to  compute  time  accurate  solutions.  It  uses  one-sided  differences  (forward  or  backward)  to 
compute  spatial  derivatives  at  each  predictor  or  corrector  step.  For  the  present  computations, 
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the  operator  L  in  the  equation  LQ  =  S,  or  equivalently  Qt  +  Fz  +  Gr  =  S,  is  split  into  two 
one-dimensional  operators  and  the  scheme  is  applied  to  these  split  operators.  We  define  Li 
as  a  one-dimensional  operator  with  a  forward  difference  in  the  predictor  and  a  backward 
difference  in  the  corrector.  Its  symmetric  variant  L2  uses  a  backward  difference  in  the 
predictor  and  a  forward  difference  in  the  corrector.  The  predictor  step  in  LiQ  for  the  one- 
dimension  model/split  equation  Qt  =  Fz  +  S  is  written  as 

4  =  Q?  +  -  f?)  -  (K+2  -  K+i)}  +  a ts< 

and  the  corrector  step  as 

4”+1  =  |[Qi  +  0?  -  ^{7(3  -  Fi-i)  -  (Fi-i  -  F,-2)}  +  AtSi] 

Similarly  in  L2  the  predictor  step  is 

Qi  =  Qt-  ^{7 (IT  ~  FT- 1)  -  (Ft i  -  +  a tSt 

and  the  corrector  step  is 

Qi+l  =  \[Qi  +  Qni+  -  Fi+ 1)  -  (Fi+1  -  Fi+2)}  +  A tSi] 


This  scheme  becomes  fourth  order  accurate  in  the  spatial  derivatives  when  alternated  with 
symmetric  variants  [4].  For  our  computations,  the  one  dimensional  sweeps  are  arranged  as 

Qn+1  =  LlxLlrQn 
Qn+2  =  L2rL2xQn+1 

This  scheme  is  used  for  the  interior  points.  In  order  to  advance  the  scheme  near  boundaries, 
the  fluxes  are  extrapolated  outside  the  domain  to  artificial  points  using  a  cubic  extrapolation 
to  compute  the  solution  on  the  boundary.  We  use  the  characteristic  boundary  condition  at 
the  outflow.  In  our  implementation,  we  solve  the  following  set  of  equations  to  get  the  solution 
at  the  new  time  for  all  boundary  points. 


pt  -  pcut  =  0 

Pt  +  pcut  =  R2 
Pt  ~  c2pt  =  R3 
vt  =  R± 

where  Ri  is  determined  by  which  variables  are  specified  and  which  are  not.  Whenever  the 
combination  is  not  specified,  Ri  is  just  those  spatial  derivatives  that  come  from  the  Navier- 
Stokes  equations.  For  further  details  see  [6]. 
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Figure  1:  Axial  momentum  in  an  excited  axisymmetric  jet 

Let  r  be  the  radius  of  the  nozzle.  Figure  1  shows  a  contour  plot  of  axial  momentum  from  the 
solution  of  the  Navier-Stokes  equations  for  a  domain  of  size  50r  in  the  axial  direction  and 
5 r  in  the  radial  direction  with  a  250  x  100  grid.  The  size  of  the  domain  and  the  chosen  grid 
represent  a  reasonable  problem  size.  The  result  was  obtained  after  about  16,000  time  steps. 
For  all  other  results  in  this  paper,  we  have  used  the  same  grid,  but  run  the  experiments  for 
5000  time  steps  to  keep  the  computing  requirements  reasonable. 


3  Parallel  Computing  Platforms 

This  section  contains  a  brief  discussion  of  the  various  platforms  used  in  the  study  together 
with  the  parallelization  tools  used. 


3.1  NOW 

The  LACE  testbed  is  regularly  upgraded.  The  present  configuration  has  32  RS6000  processor 
nodes  (nodes  1-32)  and  an  RS6000/Model  990  (node  0)  which  is  the  file  server.  These 
nodes  or  subsets  of  them  are  connected  through  various  networks  with  different  speed  and 
connection  characteristics.  All  the  nodes  are  connected  through  two  Ethernet  networks  (10 
Mbits/sec  (Mbps));  one  of  them  is  for  general  use  and  the  other  is  dedicated  to  “parallel” 
use.  Nodes  9-24  are  interconnected  through  a  FDDI  interface  with  a  peak  bandwidth  of  100 
Mbps.  It  is  convenient,  for  our  purposes,  to  consider  the  nodes  to  be  partitioned  into  a  lower 
half  (nodes  1-16)  and  an  upper  half  (nodes  17-32).  The  lower  half  has  RS6000/Model  590 
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CPUs  (the  CPU  has  a  66.5  MHz  clock,  256KB  data-  and  32KB  instruction  caches)  with  the 
following  networks  interconnecting  the  nodes:  an  ATM  network  capable  of  a  peak  bandwidth 
of  155  Mbps  and  IBM’s  ALLNODE  switch,  referred  to  as  ALLNODE-F  (for  fast),  capable 
of  a  peak  throughput  of  64  Mbps  per  link.  The  upper  half  has  the  slower  RS6000/Model 
560  CPUs  (the  CPU  has  a  50  MHz  clock,  64KB  data-  and  8KB  instruction  caches)  and 
is  connected  through  IBM’s  ALLNODE  prototype  switch,  referred  to  as  ALLNODE-S  (for 
slow),  capable  of  a  peak  throughput  of  32  Mbps  per  link.  The  ALLNODE  switch  is  a  variant 
of  Omega  interconnection  network  and  is  capable  of  providing  multiple  contentionless  paths 
between  the  nodes  of  the  cluster  (a  maximum  of  8  paths  can  be  configured  between  source 
and  destination  processors).  The  present  setup  does  not  permit  the  use  of  more  than  16 
processors  using  the  faster  networks.  The  nodes  have  varying  main  memory  capacity  (64 
MB,  128  MB,  256  MB,  and  512  MB).  We  have  used  the  popular  PVM  (Parallel  Virtual 
Machine)  message  passing  library  (version  3.2.2)  to  implement  our  parallel  programs.  We 
will  refer  to  the  LACE  cluster  with  RS6000/Model  560  processors  as  the  LACE/560  and 
those  with  the  RS6000/Model  590  processors  as  the  LACE/590. 

3.2  Shared  Memory  Architecture 

We  used  the  Cray  Y-MP/8,  which  has  eight  vector  processors,  for  this  study.  The  Cray 
Y-MP/8  has  a  peak  rating  of  approximately  2.7  GigaFLOPS.  It  offers  a  single  address  space 
and  the  communication  between  processes  executing  on  different  processors  is  through  shared 
variables.  We  parallelized  the  application  by  using  explicit  DO  ALL  directives  in  addition  to 
exploiting  the  features  of  the  parallelizing  compiler  on  the  Cray. 


3.3  Distributed  Memory  Architecture 

We  parallelized  the  application  on  two  distributed  memory  multiprocessors-  the  IBM  SPl 
and  the  Cray  T3D.  The  IBM  SPl  has  16  processing  nodes  (the  CPU  in  each  node  is  a 
RS6K/370  with  a  50  MHz  clock,  32KB  data  and  instruction  caches).  The  original  system 
has  been  software  upgraded  to  make  it  function  like  a  SP2.  We  will  refer  to  this  system 
as  the  IBM  SP  in  the  paper.  The  nodes  of  the  SP  are  interconnected  through  a  variant  of 
the  Omega  network  [17].  This  network,  similar  in  topology  to  ALLNODE,  permits  multiple 
contentionless  paths  between  nodes.  We  parallelized  the  application  using  MPL  (Message 
Passing  Library),  IBM’s  native  message  passing  library  and  PVMe,  a  customized  version  of 
PVM  (version  3.2)  developed  by  IBM  for  the  SP. 

The  Cray  T3D  is  also  a  distributed  memory  multiprocessor  with  the  topology  of  a  three 
dimensional  torus  [15].  The  machine  used  in  our  study  has  64  nodes  (8x4x2)  (each  node 
has  a  CPU  with  a  clock  speed  of  150  MHz  and  a  direct  mapped  cache  of  8KB),  of  which 
only  16  were  available  in  single  user  mode.  Though  the  T3D  supports  multiple  programming 
models,  we  programmed  the  machine  using  the  message  passing  paradigm  resorting  to  Cray’s 
customized  version  of  PVM  (version  3.2). 
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4  Parallelization 


The  factors  which  affect  performance  are  listed  below. 

1.  Single  processor  performance:  we  will  explain  various  optimizations  which  resulted  in 
80%  improvement  in  performance. 

2.  Communication  cost:  this  cost  depends  on  both  the  number  of  communication  startups 
and  the  volume  of  data  communicated.  Usually,  the  startup  cost  is  2-3  orders  of  magnitude 
higher  than  the  per  item  (usually  byte)  transfer  cost.  One  method  to  reduce  the  effect  of 
startup  cost  is  to  group  data  to  be  communicated  into  long  vectors. 

3.  Overlapped  communication  and  computation:  it  is  desirable  that  communication  be 
overlapped  with  computation  as  far  as  possible.  Increasing  the  amount  of  overlapping, 
however,  could  lead  to  finer  granularity  of  communication  which  then  leads  to  a  higher 
number  of  startups. 

4.  Bursty  communication:  such  communication  could  overwhelm  the  network’s  throughput 
capacity  temporarily  leading  to  increased  communication  cost  and  process  waiting  time. 
Some  amount  of  burstiness  is  inevitable  since  parallel  programs  are  usually  written  in  the 
SPMD  (single  program  multiple  data)  style.  There  is  also  usually  an  inverse  relationship 
between  bursty  data  and  the  number  of  communication  startups. 

Prom  the  above  discussion  it  is  clear  that  there  is  a  subtle  relationship  among  communication 
startup  cost,  overlapping  communication  with  computation,  and  bursty  communication. 

For  the  solution  of  Navier-Stokes  equations,  hereafter  referred  to  as  Navier-Stokes,  each 
internal  subdomain  exchanges  its  two  flux  values,  velocity,  and  temperature  along  the  bound¬ 
ary  with  its  appropriate  (left  or  right)  neighbor.  To  reduce  the  number  of  communication 
startups,  we  group  communication-  first,  all  the  velocity  and  temperature  values  along  a 
boundary  are  calculated  and  then  packaged  into  a  single  send.  We  use  a  similar  scheme  for 
the  flux  values  that  need  to  be  communicated. 

The  computational  and  communication  requirements  of  the  application  are  shown  in  Tables 

1  and  2.  It  is  seen  that  the  solution  of  Euler  equations,  hereafter  referred  to  as  Euler,  has 
roughly  50%  of  the  computation  the  communication  volume  requirements  of  Navier-Stokes 
although  the  computation  to  communication  volume  ratio  is  about  the  same.  Note  that  the 
communication  requirements  are  shown  on  a  per  processor  basis.  To  give  some  idea  of 
the  effects  of  communication,  consider  Navier-Stokes  to  be  executing  on  a  network  of  10 
workstations  connected  via  Ethernet.  Assume  a  reasonable  throughput  of  20  MFLOPS  per 
processor  and  the  maximum  throughput  of  10  Mbps  for  Ethernet.  The  computation  time 
will  then  be  approximately  725  seconds  (145,000/(10  x  20))  while  a  lower  bound  on  the 
communication  time,  ignoring  the  effect  of  startups,  is  1000  seconds  (1000  x  10/10)!  Table 

2  shows  the  ratio  of  computation  to  communication  for  the  application  in  units  of  floating 
point  operations/byte  transferred  per  processor  and  floating  point  operations/startup  per 
processor. 

The  application  is  parallelized  by  decomposing  the  domain  by  blocks  along  the  axial  direc¬ 
tion  only.  Two  dimensional  partitioning  was  not  attempted  since  a  simple  analysis  shows 
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Table  1:  Application  Characteristics 


Appln 

Total  Comp. 

(in  FP  Ops  (x  106)) 

Comm./Processor  , 

Start-ups 

Volume  (MB) 

N-S 

145,000 

80,000 

120  (960Mb) 

Euler 

77,000 

20,000 

64  (512Mb) 

Table  2:  Computation-Communication  Ratios 


No.  ofProcs. 

FPs/Byte 

FPs/Start-up 

Nav-Stokes 

Euler 

Nav-Stokes 

Euler 

2 

604 

601 

906K 

1925K 

4 

302 

301 

453K 

963K 

8 

151 

150 

227K 

481K 

16 

76 

75 

113K 

241K 

that  for  the  chosen  grid  size,  such  a  partioning  performs  worse  than  a  1-D  block  partitioning. 
For  example,  with  a  2-D  partitioning  of  16  processors  (4x4  blocks),  the  ratio  of  the  number 
of  bytes  transferred  compared  to  1-D  partitioning  is  1.25.  This  ratio  will,  of  course,  decrease 
when  we  increase  the  problem  size.  Another  disadvantage  with  2-D  partitioning  is  that  the 
number  of  start-ups  is  higher.  For  the  above  example,  the  corresponding  ratio  for  the  two 
partionings  is  1.6;  this  ratio  does  not  decrease  with  the  problem  size.  Since  the  startup  cost 
dominates  the  transmission  cost  in  most  current  architectures  including  the  ones  used  in  this 
study  (the  ratio  is  highest  for  LACE  and  least  for  Cray  T3D)  and  the  average  transmission 
volume  per  startup  is  only  moderate  (Table  1),  we  did  not  experiment  with  2-D  partitioning. 

The  parallelization  on  the  Cray  Y-MP  was  done  differently  (it  was  much  easier  also)  since  it 
is  a  shared  memory  architecture:  we  did  some  hand  optimization  to  convert  some  loops  to 
parallel  loops,  used  the  DOALL  directive,  and  partitioned  the  domain  along  the  orthogonal 
direction  of  the  sweep  to  keep  the  vector  lengths  large  and  to  avoid  non-stride  access  to  most 
of  the  variables. 


5  Experimental  Methodology 

The  performance  indicator  is  the  total  execution  time.  All  experiments  were  conducted  in 
single  user  mode.  The  execution  times  reported  are  for  single  runs  since  we  found  that  the 
experiments  were  repeatable  with  negligibly  small  discrepancies.  For  example,  with  LACE, 
the  deviation  from  the  mean  is  about  1%  or  less  ([6]). 

Experiments  using  a  single  processor  were  done  on  an  IBM  RS6K  (Model  560)  workstation 
of  LACE.  The  performance  of  the  original  code  (Version  1)  for  both  applications  is  shown 
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in  Figure  2. 


Version 

Figure  2:  Execution  time  on  a  single  processor  (RS6000/560) 

We  found  that  most  parts  of  the  application  were  limited  by  the  poor  performance  of  the 
memory  hierarchy  involving  the  cache  and  the  main  memory.  Improved  cache  performance 
was  the  key  and  this  was  achieved  by  accessing  arrays  in  stride-1  fashion  wherever  possible 
(using  the  loop  interchange  optimization).  The  modified  program,  called  Version  3  (the 
optimizations  were  performed  in  a  different  order  than  presented  in  the  paper),  resulted  in 
this  version  running  faster  by  approximately  50%,  compared  to  Version  2.  We  experimented 
with  a  number  of  other  modifications,  the  following  of  which  yielded  some  improvement: 
better  register  usage  by  collapsing  multiple  COMMON  blocks  into  a  single  one  (Version  5), 
strength  reduction  (replace  exponentiations  by  multiplications  wherever  feasible-  Version  2), 
replace  division  by  multiplication  wherever  feasible  since  the  former  are  relatively  expensive 
(a  reduction  from  5.5  x  109  divisions  to  2.0  x  109  was  achieved-  Version  4).  All  these 
optimizations  yielded  an  overall  improvement  of  roughly  80%  (from  9.3  MFLOPS  to  16.0 
MFLOPS)  as  illustrated  in  Figure  2.  The  optimizations  were  incorporated  in  sequence  so 
that  Version  5  contains  all  the  above  mentioned  optimizations. 

We  parallelized  Version  5  on  different  computing  platforms  in  accordance  with  the  ideas 
presented  in  the  last  section.  On  each  platform,  we  measured  the  execution  time  as  a 
function  of  the  number  of  processors  (up  to  8  with  Cray  Y-MP,  up  to  16  with  LACE,  IBM 
SP,  and  Cray  T3D). 

We  have  studied  the  performance  of  LACE  with  four  networks  of  differing  characteristics 
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using  “off-the-shelf’  PVM  as  the  message  passing  library.  With  the  IBM  SP,  we  have  studied 
the  impact  of  parallelizing  the  application  with  two  message  passing  libraries-  IBM’s  native 
MPL  and  a  customized  version  of  PVM  called  PVMe. 

In  all  experiments,  wherever  feasible,  we  have  separated  the  execution  time  into  two  additive 
components:  processor  busy  time  and  non-overlapped  communication  time.  The 
processor  busy  time  is  itself  composed  of  the  actual  computation  time  and  the  software 
overheads  associated  with  sending  and  receiving  messages.  An  accurate  separation  of  these 
components  is  not  possible,  however,  unless  we  have  hardware  performance  monitoring  tools. 
The  non-overlapped  communication  time  could  also  include  the  idle  time  of  a  processor 
waiting  for  a  message. 

Version  5  of  the  application  does  not  make  any  special  attempts  to  overlap  communication 
with  computation.  Version  6  does  overlapping  by  computing  the  stress  and  flux  components 
of  the  interior  part  of  each  subdomain  while  the  processor  is  waiting  for  the  velocity  and 
temperature  vectors  from  its  neighbors.  Figure  3  shows  the  timeline  of  a  processor’s  activity 
for  both  these  versions.  As  mentioned  ear  Her,  the  two  “flux  columns”  nearest  each  boundary 
are  combined  into  a  single  send.  We  have  experimented  with  sending  the  flux  columns  one 
at  a  time  to  avoid  bursty  communication.  This  variant  is  called  Version  7. 


VERSION  5 


VERSION  6 


Calculate 
VEL,  TEMP 
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Send 

VEL,  TEMP 
Receive 
VEL,  TEMP 
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Send 
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Figure  3:  Timeline  of  processor  activity 

We  found  that  the  execution  time  improvement  with  Versions  6  and  7  were  either  minimal 
or  even  worse  in  many  experiments.  Hence  all  our  experiments  were  conducted  with  Version 
5.  We  do  mention,  however,  the  impact  of  these  versions  on  different  networks  of  LACE  in 
Section  6.1. 

The  next  section  presents  a  detailed  discussion  of  the  results  from  our  experiments. 
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6  Results 


The  execution  times  of  Navier-Stokes  and  Euler  have  been  plotted  as  a  function  of  the 
number  of  processors  for  each  computing  platform,  using  a  log-log  scale  to  facilitate  mean¬ 
ingful  presentation. 


6.1  Performance  of  LACE 


Figure  4:  Navier-Stokes  execution  time  on  LACE 

Figure  4  shows  the  performance  of  Navier-Stokes  on  different  networks  of  LACE-  ALLNODE- 
F,  ALLNODE-S,  and  the  upper-half  Ethernet.  The  performance  of  the  ATM  and  the  FDDI 
networks  are  almost  identical  with  ALLNODE-F  and  ALLNODE-S  respectively.  Hence  the 
performance  of  the  ATM  and  FDDI  networks  are  not  shown. 

The  close  performance  of  ALLNODE-F  and  ATM,  and  ALLNODE-S  and  FDDI  can  be 
attributed  to  the  following  reason:  the  slower  link  speed  of  ALLNODE  (64  Mbps/32  Mbps) 
is  balanced  by  its  ability  to  set  up  multiple  contention-free  paths  while  ATM  (155  Mbps) 
or  FDDI  (100  Mbps)  with  their  faster  links  do  not  permit  multiple  physical  paths  in  the 
network. 

With  ALLNODE,  the  execution  time  falls  almost  linearly  with  increasing  number  of  processors- 
sublinearity  effects  begin  to  show,  however,  beyond  12  processors.  ALLNODE-F  is  about 
70%-80%  faster  than  ALLNODE-S.  This  can  be  attributed  to  both  an  improved  network 
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(which  is  twice  as  fast)  and  the  superior  performance  of  the  590  model  (33%  faster  clock,  data 
and  instruction  caches  which  are  4  times  bigger,  and  memory  bus  which  is  4  times  wider  than 
the  560-  these  contribute  to  faster  instruction  execution,  better  cache  hit  ratios,  and  lower 
cache  miss  penalty  respectively).  Ethernet  performance  reaches  its  peak  at  8  processors- 
beyond  this,  the  communication  requirements  of  the  application  overwhelm  the  network. 
The  inability  of  Ethernet  to  handle  traffic  beyond  8  processors  is  shown  by  the  following 
simple  argument:  Table  2  shows  that  with  8  processors,  Navier-Stokes  produces  a  byte 
for  communication  after  it  has  completed  151  floating  operations  on  the  average.  Consider 
a  1  second  interval  and  each  processor  operating  at  20  MFLOPS.  During  this  interval,  each 
processor  produces  1.06  Mb  for  communication,  on  the  average.  This  translates  to  approx¬ 
imately  8.5Mbs  from  all  the  8  processors.  Ethernet  is  capable  of  supporting  10Mbps  peak- 
the  performance  seen  by  an  application  will  be  a  fraction  of  this  bandwidth,  however;  it  is 
not  surprising,  therefore,  the  performance  gets  steadily  worse  beyond  8  processors. 


Figure  5:  Components  of  execution  time  (Navier-Stokes;  LACE) 

Figure  5  aid  in  a  more  in  depth  analysis  of  the  performance  of  LACE.  The  execution  time  is 
separated  into  two  additive  components  as  explained  in  the  previous  section.  It  is  seen  that 
the  processor  busy  time  falls  linearly  with  the  number  of  processors.  With  Ethernet,  the 
non-overlapped  communication  time  increases  superlinearly  with  the  number  of  processors. 
With  both  ALLNODE  switches,  this  time  remains  steady  up  to  10  or  12  processors  beyond 
which  it  begins  to  rise.  The  difference  in  processor  busy  times  and  the  communication  times 
between  the  two  ALLNODE  configurations  can  be  attributed  to  the  superior  node  and  the 
network  respectively. 
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Figure  6:  Communication  optimization  (Navier- Stokes;  LACE) 


Figures  6  and  7  show  the  performance  of  Versions  5,  6,  and  7  with  Ethernet  and  ALLNODE-S 
(the  trends  are  similar  with  ALLNODE-F).  The  performance  of  Version  6  (with  overlapped 
communication  and  computation,  as  explained  in  Section  5)  is  very  close  to  that  of  Ver¬ 
sion  5  for  both  Ethernet  and  ALLNODE-S.  Overlapping  does  not  increase  the  number  of 
communication  startups.  With  Version  6,  since  computations  for  the  subdomain  have  to  be 
broken  into  separate  ones  for  the  interior  and  the  boundary  (only  the  former  computations 
can  be  overlapped  with  communication),  the  loop  setup  overheads  are  higher.  Further,  the 
cache  performance  also  degrades  slightly  due  to  loss  of  temporal  locality.  Consequently,  these 
overheads  offset  any  gain  due  to  overlapping. 

Version  7  attempts  to  reduce  bursty  communication  at  the  cost  of  increased  number  of 
communication  startups.  Not  surprisingly,  Ethernet  performs  better  with  Version  7  than 
with  Version  5.  The  performance  of  ALLNODE-S  is  appreciably  worse  than  Version  5, 
however.  Since  ALLNODE-S  can  handle  the  communication  requirements  of  the  application, 
reducing  bursty  communication  only  harms  the  performance  since  the  number  of  startups 
increase. 

6.2  Comparative  Performance 

Figures  8  and  9  show  the  performance  of  the  application  on  the  four  computing  platforms  we 
have  chosen  for  this  study —  LACE,  Cray  Y-MP,  Cray  T3D  and  IBM  SP.  The  performance 
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Figure  7:  Communication  optimization  (Euler;  LACE) 
of  LACE  is  reported  for  ALLNODE-F  and  ALLNODE-S. 

Surprisingly,  LACE,  even  with  ALLNODE-S,  outperforms  SP  even  though  the  former  uses 
off-the-shelf  PVM  and  the  latter  uses  MPL,  IBM’s  native  message  passing  library.  (Our 
version  of)  MPL  imposes  a  limit  on  the  number  of  (non-blocking)  send  primitives  that  can 
be  simultaneously  active-  this  limit  is  lower  than  the  requirements  of  the  application;  hence, 
we  were  forced  to  use  blocking  send  primitives.  We  suspect  this  to  be  one  contributing 
factor  to  the  relatively  poor  performance.  The  CPU  on  the  SP  is  intermediate  in  speed  (62.5 
MHz  clock)  between  the  560  (50  MHz)  and  the  590  (66.6MHz).  Another  contributor  to  the 
poor  performance  of  the  SP  is  the  relatively  small  size  of  the  data  cache  (32KB  compared 
to  64KB  on  LACE/560  and  256KB  on  LACE/590). 

For  a  comparison  of  ALLNODE-F  and  ALLNODE-S,  see  Section  6.1  (Figures  4  and  5). 

Another  surprising  result  is  the  relatively  poor  performance  of  Cray  T3D  which  is  consistently 
worse  than  ALLNODE-F  and  is  worse  than  ALLNODE-S  for  less  than  8  processors.  The 
T3D’s  CPU  has  a  peak  rating  which  is  2.3X  and  3X  the  rating  of  the  590  and  560  models, 
respectively.  We  attribute  the  T3D’s  poor  performance  to  the  small  direct-mapped  cache 
of  8KB  size  (both  the  560  and  590  have  4-way  set-associative  date  caches  of  sizes  64KB 
and  256  KB  respectively;  in  addition  they  have  2-way  set  associative  instruction  caches  of 
sizes  8KB  and  32KB).  Poor  single-processor  performance  on  the  T3D  has  also  been  reported 
elsewhere  [17].  These  results  stress  the  importance  of  superior  cache  design  to  the  overall 
performance.  A  reasonably  fast  CPU  with  a  large,  set  associative  cache  and  a  high  bandwidth 
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Figure  8:  Execution  time  of  Navier-Stokes  on  computing  platforms 


Number  of  Processors 

Figure  9:  Execution  time  of  Euler  on  computing  platforms 


Table  3:  Speedup  Characteristics 


No.  ofProcs. 

Architectural  Platform 

ALLNODE-S 

ALLNODE-F 

IBM  SP 

Cray  T3D 

4 

3.2 

3.4 

3.8 

3.9 

16 

7.5 

7.9 

13.3 

14.6 

bus  connecting  the  cache  and  main  memory  performs  superior  to  a  much  faster  CPU  with 
poorly  designed  CPU-cache  and  cache-main  memory  interfaces. 

Table  3  shows  the  speedup  characteristics  of  Navier-Stokes  with  various  architectures. 
The  speedups,  measured  relative  to  the  single  processor  performance  of  the  corresponding 
architecture,  are  shown  with  4  processors  and  16  processors  to  illustrate  the  network  charac¬ 
teristics  of  each  architecture.  Both  T3D  and  SP  exhibit  very  good  speedup  characteristics, 
with  almost  linear  speedup,  indicating  that  the  corresponding  networks  can  sustain  the  com¬ 
munication  requirements  of  the  application.  Though  both  the  ALLNODE  networks  have  a 
reasonable  speedup  at  4  processors,  they  degrade  rapidly  and  the  speedup  is  only  modest 
at  16  processors.  It  is  only  reasonable  to  expect  this  flattening  trend  to  continue  with  in¬ 
creasing  number  of  processors  on  NOW  architectures.  Not  surprisingly,  ALLNODE-F  has  a 
slightly  better  speedup  than  ALLNODE-S.  Also,  observe  from  Figures  8  and  9  that  beyond 
8  processors,  T3D  with  its  superior  network  speed  (150  MB/sec  peak  transfer  rate  and  a 
relatively  small  setup  cost)  performs  better  than  ALLNODE-S. 

Cray  Y-MP  has  by  far  the  best  performance.  The  execution  time  shown  is  the  connect  time 
in  single  user  mode  (this  includes  the  I/O  time  also  which  we  were  not  able  to  separate 
from  the  computation  time).  Considering  the  effect  of  connect  time  also,  the  Y-MP  scales 
modestly  achieving  a  speedup  of  about  3.5  with  8  processors  for  both  the  applications.  With 
a  profiling  tool  which  simulates  the  execution  of  an  8-node  Y-MP  on  a  single  node  (and  does 
not  include  the  I/O  overheads),  we  obtained  a  speedup  of  7.1.  Observe  that  the  LACE/590 
with  16  processors  is  about  8%  faster  than  a  single  node  of  the  Y-MP. 

With  architectures  which  use  message  passing  libraries,  the  relatively  poor  performance  can 
be  attributed  to  large  setup  overheads  and  the  resulting  increase  in  processor  waiting  times 
with  increasing  number  of  processors.  These  overheads  arise  mainly  from  the  multiple  times 
that  data  to  be  communicated  is  copied  and  from  the  context  switching  overheads  that  arise 
in  transferring  a  message  between  the  application  level  and  the  physical  layer  of  the  network 
for  transmission  or  reception.  If  NOW  architectures  are  to  be  feasible  as  massively  parallel 
processors,  it  is  clear  that  both  the  interconnection  network  and  the  message  passing  library 
be  implemented  efficiently.  Such  effort  is  already  under  way  [1]. 
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6.3  Comparison  of  Message  Passing  Libraries 


Figures  10  and  11  compare  the  performance  of  the  PVMe  and  the  MPL  message  passing 
libraries  on  the  SP —  the  execution  times  have  been  separated  into  non-overlapping  compu¬ 
tation  and  communication  components.  The  graphs  show  that  MPL  is  consistently  faster 


Figure  10:  Comparison  of  MPL  and  PVMe  (Navier-Stokes;  IBM  SP) 

than  PVMe  by  approximately  75%  for  Navier-Stokes  and  approximately  40%  for  Euler. 
Observe  also  that  the  amount  of  non-overlapped  communication  is  not  only  negligibly  small 
but  that  it  decreases  with  the  number  of  processors  though  the  actual  communication  in¬ 
creases.  This  is  an  interesting  phenomenon  since  it  implies  that  there  is  increased  overlapping 
of  computation  and  communication  with  the  number  of  processors.  Note  however  that  the 
computation  part  also  includes  the  setup  overheads  of  communication.  This  phenomenon  is 
not  seen  in  case  of  LACE  (see  Figure  4)  where  the  non-overlapped  communication  increases, 
further  attesting  to  our  previous  observation  that  the  MPL  (and  PVMe)  library  does  not 
perform  as  well  as  PVM  does  on  LACE. 


6.4  Load  Balancing 

Finally,  how  well  is  the  application  load  balanced?  The  amount  of  computation  for  the  appli¬ 
cation  is  evenly  distributed  but  this  may  not  always  translate  to  a  load  balanced  execution. 
We  were  able  to  measure  the  processor  busy  times  (this  time  does  not  include  the  processor 
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Figure  11:  Comparison  of  MPL  and  PVMe  (Euler;  IBM  SP) 

waiting  time)  for  Navier-  Stokes  on  each  processor  of  the  SP.  Figure  12  shows  that  we  were 
able  to  achieve  almost  perfect  load  balancing. 


7  Discussion  and  Conclusion 

In  this  paper  we  have  studied  the  computational,  communication,  and  scalability  characteris¬ 
tics  of  a  typical  CFD  application  on  a  variety  of  architectural  platforms.  The  study  indicates 
that  NOW  architectures  have  the  potential  to  be  cost-effective  parallel  architectures  if  the 
networks  are  made  reasonably  fast  and  message  passing  libraries  are  efficiently  implemented 
to  circumvent  the  traditional  overheads  involved  in  transferring  a  message  between  the  ap¬ 
plication  level  and  the  physical  layer  of  the  network. 

The  study  also  highlights  the  importance  of  single  processor  performance  to  achieve  good 
performance.  With  fast,  off-the-shelf  RISC  processors  available,  the  bottleneck  seems  to  be 
the  performance  of  the  cache  and  the  memory  hierarchy.  A  proper  cache  design  is  critical  to 
good  performance.  We  believe  that  the  reason  for  relatively  poor  performance  of  the  T3D, 
in  spite  of  a  fast  processor,  is  the  small,  direct-mapped  cache. 

A  traditional  vector  multiprocessor  still  outperforms  multiprocessors  of  modest  to  medium 
size.  Parallelizing  an  application  using  message  passing  libraries  is  rather  tedious  and  even 
error-prone  but  with  distributed  memory  multiprocessors,  this  effort  is  worthwhile  since  good 
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Figure  12:  Processor  busy  times  (Navier-Stokes;  IBM  SP) 
scalability  is  achievable. 

Resource  limitations  have  forced  us  to  limit  our  study  to  16  processors.  We  hope  to  extend 
the  study  to  larger  multiprocessors  and  to  other  parallelization  tools  as  resources  become 
available.  For  reasons  mentioned  in  Section  4,  we  have  not  explored  decomposition  along 
both  the  axial  and  radial  directions.  A  future  goal  is  to  conduct  the  study  for  a  larger  domain 
to  understand  the  physics  of  the  problem  better  and  a  finer  mesh  to  compute  the  jet  noise 
directly  from  the  flow  field.  We  plan  to  explore  the  effects  of  2-D  partitioning  with  a  larger 
domain  and  a  finer  mesh. 
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